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Problem  Studied 


The  broad  goal  of  this  project  is  to  formulate  mathematical  models  which  are  sufficiently 
general  to  support  an  analysis  of  particular  algorithms  for  target  detection  and 
recognition  from  the  perspective  of  classical  statistics  and  information  theory.  When 
approached  from  this  viewpoint,  questions  about  the  performance  of  an  algorithm  for 
detection,  recognition  or  identification  translate  into  familiar  problems  in  estimation, 
complexity  and  hypothesis  testing.  Consequently,  an  arsenal  of  powerful  results  from 
statistics  and  information  theory,  for  example  results  about  optimal  codes,  most-powerful 
tests,  inference  and  efficiency,  and  the  complexity  of  testing  highly  composite 
hypotheses,  can  be  exploited  to  achieve  a  deeper  understanding  of  the  ATR  problem. 

The  focus  of  the  research  is  on  performance  metrics,  various  measures  of  an  algorithms 
performance  such  as  probability  of  detection,  probability  of  "false  alarm,"  bias/variance 
tradeoffs  for  algorithms  that  learn  from  training  data,  and  computational  complexity. 

Summary  of  Key  Results 

Performance  Bounds  and  Theoretical  Performance  Analysis 

An  explicit  goal  of  the  project  is  the  determination  of  lower  bounds  for  various  measures 
of  performance:  What  is  the  best  performance  that  one  can  possibly  achieve  within  a 
generic  class  of  algorithms? 

D.  Geman  and  colleagues  have  proved  that  coarse-to-fine  (CTF)  testing  is 
computationally  optimal  under  a  certain  statistical  model.  More  specifically,  there  is  a 
certain  classifier  for  a  generic  object  class  (e.g.,  face  or  truck)  and  this  classifier  may  be 
evaluated  (computed)  in  a  great  many  ways.  The  classifier  is  based  on  a  family  of 
“tests.”  Each  way  of  evaluating  the  tests  sequentially  and  adaptively,  stopping  when  the 
state  of  the  classifier  is  determined,  corresponds  to  a  decision  tree.  Each  tree 
(evaluation)  has  therefore  exactly  the  same  error  rates,  as  it  is  merely  a  way  of 
evaluating  a  fixed  function.  But  the  trees  have  widely  varying  computational  costs.  Here 
one  assumes  each  test  has  a  certain  “algorithmic  cost.”  We  have  proven  that  under  a 
certain  model  for  the  joint  distribution  of  the  tests  under  the  hypothesis  of  “no  object,”  a 
“coarse-to-fine”  evaluation  is  optimal  in  the  sense  of  minimizing  expected  cost.  The 
coarse-to-fine  refers  to  the  fact  that  tests  have  different  levels  of  coarseness  in  terms  of 
the  number  of  poses  of  the  object  they  cover.  Another,  related  accomplishment,  was  an 
analysis  of  the  false  positive  rate  of  the  classifier,  obtained  by  making  a  transformation  to 
a  non-homogeneous  branching  process  and  then  using  more  or  less  classical  results. 

Statistics  of  Natural  Imagery 

One  cannot  meaningfully  speak  of  the  probability  of  detection  of  a  target  or  of  the 
probability  of  a  false  alarm  without  making  assumptions  about  an  underlying  probability 
distribution  on  natural  scenes.  In  the  language  of  statistics,  this  means  we  need  an 
explicit  null  hypothesis  which  models  generic  backgrounds  or  generic  images. 
Formulating  such  models  has  been  a  major  thrust  of  this  contract.  A  question  of 
fundamental  importance  is  how  to  specify  an  appropriate  mathematical  model  for 
scenes. 

In  summer  1999,  controlled  experiments  were  carried  out  to  collect  and  categorize  a 
database  of  laser  range  images  in  different  settings:  forest,  residential  cityscapes,  indoor 


scenes,  and  miscellaneous  others.  Following  the  data  collection,  systematic  data 
analysis  was  performed  to  learn  empirical  distributions  of  a  variety  of  image  statistics. 
The  particular  analyses  done  were  motivated  by  earlier  striking  findings  of  scale- 
invariance  properties  of  reflectance  images.  Finally,  various  mathematical  models,  both 
descriptive  ones  and  generative  ones,  were  proposed  and  analyzed  for  their  potential  to 
adequately  represent  the  empirical  results  for  the  range  image  data  set.  (Descriptive 
models  describe  analytically  the  essential  properties  of  the  low-dimensional  marginal 
distributions  of  the  image  statistics,  while  generative  models  incorporate,  in  addition,  the 
direct  connections  between  those  marginal  distributions  and  the  geometry  and  spatial 
arrangement  of  objects  in  the  3D  world  giving  rise  to  the  distributions.) 

Other  groups  have  started  with  the  assumption  that  wavelet  decompositions  of  images 
are  the  best  way  to  describe  their  statistics.  We  do  not  believe  this  is  true  because  we 
find  complex  dependencies  between  pixels  and  between  wavelet  coefficients  which 
reflect  the  preferred  geometries  of  the  world.  This  can  be  understood  by  contrasting  it 
with  the  (false)  assumption  that  natural  images  are  white  noise:  this  would  say  that  all 
pixels  are  independent  and  there  is  no  geometry  in  scenes.  The  wavelet  point  of  view 
says  there  is  very  simple  geometry  but  it,  too,  can  be  eliminated  by  taking  a  suitable  filter 
basis  for  images.  Our  approach  is  to  look  at  large  numbers  of  local  3  by  3  patches  both 
in  optical  and  range  images,  and  describe  directly  their  full  joint  statistics.  This  was 
begun  in  Huang's  thesis  and  has  been  extended  greatly  by  Lee  and  now  Pedersen,  a 
visiting  graduate  student  from  Denmark.  The  results  are  very  striking:  the  empirical 
distribution  of  high  contrast  3x3  patches  has  a  singular  surface  along  which  it  has  infinite 
density  and  in  whose  neighborhood  about  half  such  patches  are  located.  This  surface  is 
highly  curved  and  represents  images  of  ideal  edges.  Thus  even  a  simple  geometry  in  the 
image  plane,  a  geometry  of  straight  edges,  produces  a  very  complex  cluster  in  the 
vector  space  of  local  images. 

Clutter  Modeling 

A  principal  motivation  for  studying  statistics  of  natural  images  is  the  group’s  objective  of 
developing  realistic  and  usable  mathematical  models  for  clutter  in  natural  images; 
indeed,  the  generative  models  for  natural  image  statistics  mentioned  above  are  clutter 
models.  The  objective  of  understanding  and  modeling  clutter  was  stated  in  the  original 
proposal  and  received  steady  attention  throughout  the  project.  For  example, 

Grenander’s  work  on  modeling  clutter  in  SAR  imagery  was  reported  in  an  early  review 
meeting.  In  addition  to  the  new  work  on  generative  models  for  image  statistics, 
Grenander  recently  introduced  the  so-called  transported  generator  model.  The  model  is 
a  simple  one  and  has  the  potential  to  form  the  foundation  of  more  ambitious  clutter¬ 
modeling  work. 

A  variety  of  approaches  to  expand  and  improve  the  models  were  explored,  representing 
different  individual  views.  Grenander  outlined  a  plan  for  attacking  the  clutter-modeling 
challenge — a  plan  that  his  recent  work  instantiated.  Essential  elements  of  the  plan 
include  (1)  experiments  with  real  data,  (2)  design  of  models  tuned  to  different  categories 
of  scenes  (forest,  desert,  cityscape...),  (3)  linkage  of  the  models  to  the  sensor  modality 
(IR,  laser  radar,  SAR,  reflectance...),  and  (4)  using  the  models  for  algorithm  design  and 
performance  analysis. 


Compositional  Models 

Compositionality  refers  to  the  evident  ability  of  humans  to  represent  knowledge  through 
a  hierarchy  of  part-whole  relationships.  This  is  widely  believed  to  be  the  basis  for 
language  representation,  and  many  believe  that  it  is  in  fact  a  fundamental  organizing 
principle  of  cognition.  One  of  the  main  thrusts  of  the  MURI  effort  has  been  to  explore  the 
implications  of  a  compositional  formulation  of  the  machine  vision  problem:  What  are  the 
fundamental  computational  limitations?  What  are  the  implications  for  performance,  in 
terms  of  “learning”  (parameter  estimation)  and  recognition  accuracy? 

The  first  step  is  to  formulate  the  idea  of  compositionality  as  a  precise  mathematical 
theory  of  representation.  This  has  been  largely  achieved,  through  a  rigorous 
probabilistic  formulation,  and  there  is  now  the  beginning  of  a  theory  of  inference 
(learning).  Shih-Hsiu  Huang,  a  Ph.D.  student,  completed  a  software  implementation  of  a 
"composition  machine.”  The  key  contribution  of  the  thesis  is  a  coarse-to-fine 
computational  strategy  that  yields,  at  any  instant,  an  approximate  image  interpretation. 

A  continuum  formulation  was  worked  out,  and  this  yielded  a  surprising  connection  to 
what  is  known  as  “scaling”  in  natural  images.  The  latter  refers  to  the  remarkable 
observation  that  the  statistics  of  natural  images  are  essentially  invariant  to  scale: 

“blowing  up”  or  “reducing”  pictures  of  natural  scenes  preserves  the  statistical  structure. 
Scaling  requires  a  very  specific  distribution  on  the  sizes  of  objects  in  the  image  plane. 

We  discovered  that  compositional  systems  require  the  same  distribution  in  order  to  show 
scaling. 

In  the  language  of  formal  grammars,  our  composition  system  amounts  to  a  probabilistic 
treatment  of  context-sensitive  grammars.  This  connection  is  the  basis  for  an  ongoing 
collaboration  with  computational  linguists.  The  principles  of  compositionality  make 
specific  predictions  about  the  nature  of  neuronal  circuitry  and  the  physiological  solution 
to  the  so-called  “binding  problem.”  This  connection  is  the  basis  for  an  ongoing 
collaboration  with  neuroscientists. 

Given  a  (probabilistic)  compositional  model,  the  problem  of  scene  interpretation 
becomes  one  of  assigning  each  element  of  the  scene  to  a  compositional  structure  in 
such  a  way  that  the  resulting  collection  of  such  structures  maximizes  probability.  This  is 
formally  equivalent  to  the  well  studied  “covering  problems,”  which  is  known  to  be  NP- 
Complete.  This  brings  the  computational  aspect  of  computer  vision  into  sharp  focus:  the 
basic  problem  of  segmentation,  which  is  a  problem  that  must  be  solved,  simultaneously, 
at  each  of  many  levels  of  abstraction  (which  edges  go  together,  which  contours  are 
related,  which  textured  patches  are  part  of  the  same  surface,  what  are  the  object 
delineations,  which  objects  are  related  in  a  larger  “context”...)  amounts  to  the  problem  of 
choosing  a  best  covering,  and  there  is  no  known  general  polynomial-time  solution. 

Of  course,  compositional  systems  have  a  very  special  structure  and,  furthermore,  it  is 
clear  enough  that  natural  vision  systems  continuously  find  good  (perhaps  nearly  optimal) 
interpretations  of  scenes.  What  sort  of  computational  engine  can  attack  the  covering 
problem  so  effectively?  Greedy  algorithms  can  not  work — the  problem  of  “what  goes 
with  what”  is  locally  ambiguous,  which  essentially  rules  out  incremental  optimization. 
Monte  Carlo  methods  are  universal,  but  much  too  slow  for  high-dimensional  problems 
with  complex  structure.  Dynamic  programming,  per  se,  doesn't  apply  because  there  is 
no  Markov  structure — the  problem  of  scene  interpretation  is  fundamentally  global. 

The  scaling  properties  of  natural  imagery,  and  the  closely  related  scaling  properties  of 
formal  compositional  systems,  together  with  the  apparent  multi-resolution  aspect  of 


feature  detection  in  natural  vision  systems,  strongly  suggests  a  coarse-to-fine 
computation  engine.  How  efficiently  can  coarse-to-fine  processing  solve  the  image 
interpretation  problem? 

This  leads,  more  broadly,  to  the  study  of  coarse-to-fine  computation,  and  to  an  effort  to 
make  precise  the  achievable  gains  in  computational  speed.  These  issues  were  explored 
both  for  compositional  systems  and  for  closely  related  Markov  systems  with  very  large 
state  spaces.  The  latter  amounts  to  an  analysis  of  exact  coarse-to-fine  dynamic 
programming  for  general  graphical  models. 

Dynamic  programming  is  the  basic  computational  engine  behind  the  use  of  hidden 
Markov  models  and  their  generalizations.  Brian  Lucena,  a  Ph.D.  student,  completed  a 
mathematical  analysis  of  a  wonderful  "coarse-to-fine”  approach  to  dynamic 
programming,  introduced  recently  by  one  of  our  former  students,  Chris  Raphael.  Brian  is 
now  working  on  a  new  class  of  computationally  efficient  error-correcting  codes. 

In  ongoing  work,  an  analysis  of  the  fundamental  computational  limitations  inherent  in  the 
vision  problem  will  continue.  Specifically,  coarse-to-fine  algorithms  will  be  the  focus  of 
experiments  and  theoretical  analysis.  Preliminary  experiments  suggest  that,  as  a  rule, 
exponential  speed-ups  are  available  both  in  the  compositional  and  the  Markov  (dynamic 
programming)  settings. 

Learning  and  Recognition 

A  comprehensive  neural  network  model  for  learning,  detecting  and  recognizing  objects 
has  been  developed,  and  applied  to  the  analysis  of  complex  visual  scenes.  Amit  and 
D.  Geman  base  the  network  model  on  the  sparse  binary  feature  representations,  which 
have  been  used  in  the  detection  and  recognition  algorithms  developed.  Learning  is 
based  on  local  Hebbian  learning  rules  and  is  carried  out  in  a  central  module.  Robustness 
to  variations  in  pose  is  obtained  by  using  'complex'  units  that  perform  an  ORing 
operation  over  small  neighborhoods  of  the  input  feature  layers  over  a  coarser  resolution 
array.  Translation  invariant  recognition  and  detection  are  obtained  by  hard  wiring  the 
appropriate  shifting  mechanisms.  Every  shift  of  the  reference  grid  on  the  coarse 
resolution  array  is  copied  to  allow  processing  in  terms  of  interactions  with  the  central 
module.  The  massive  input  from  the  lower  layers  into  higher  layers  is  dealt  with  through 
gating  mechanisms.  Either  a  specific  set  of  feature/location  pairs  is  gated  to  allow  for 
detection,  a  particular  shift  is  gated  to  allow  for  classification  of  the  data  at  a  particular 
location  in  the  scene. 

The  ability  to  find  stable  features  of  varying  degrees  of  complexity  on  objects  and  their 
sparsity  in  background  allows  us  to  choose  from  a  family  of  algorithms  according  to 
various  specifications  of  failed  detection  and  false  positive  probabilities.  The  false 
positive  analysis  follows  very  cleanly  from  the  Poisson  statistics,  and  does  not  require 
massive  testing  on  background  images. 

This  approach  has  been  applied  to  face  detection  in  real  scenes,  detection  of  rigid  3d 
objects  in  real  scenes,  detection  of  symbols  in  highly  cluttered  artificially  generated 
scenes.  False  positive  /  false  negative  curves  can  be  predicted  using  the  statistical 
properties  in  all  these  cases. 

Compression 

Associated  with  any  connected  planar  region  having  a  sufficiently  nice  boundary,  there  is 
a  set  of  orthonormal  functions  that  are  adapted  to  the  region,  namely  the  eigenfunctions 


of  the  Laplacian.  These  functions  reflect  the  shape  of  the  region  and  thus  suggest 
themselves  possible  basis  for  representing  the  image.  The  merit  of  such  an  expansion 
can  be  evaluated  in  terms  of  the  number  of  nonzero  coefficients  required  to  achieve  a 
desired  level  of  fidelity.  For  rectangular  regions  this  specializes  to  the  usual  Fourier 
basis.  Of  course  Fourier  series  is  not  particularly  useful  for  non  rectangular  regions  and 
this  is  the  natural  modification.  Our  experimental  work  supports  the  idea  that  these 
ideas  can  significantly  improve  the  bits  per  pixel  and  it  has  been  suggested  that  the 
coding  scheme  developed  may  be  consistent  with  the  most  recent  MPEG  standards. 

Variational  Approach  to  Bayesian  Estimation 

Sanjoy  Mitter  in  joint  work  with  Nigel  Newton  of  the  University  of  Essex  developed  a 
Variational  Theory  for  Bayesian  Estimation  which  characterizes  the  conditional 
distribution  as  the  solution  of  a  variational  problem  of  minimizing  a  certain  Free  Energy. 
This  theory  is  very  general  and  applies  to  Hidden  Markov  Models  based  on  a  Markov 
Random  Field.  This  research  makes  non-trivial  connections  to  recent  work  on  Inference 
on  Graphs,  Coding  theory  and  Non-Equilibrium  Statistical  Mechanics.  It  has  enabled  us 
to  solve  the  long-standing  open  problem  of  giving  a  Variational  View  of  Non-linear 
Filtering.  Several  papers  describing  this  work  are  in  preparation. 

Temporal  Information  in  Recognition 

We  developed  a  coherent  statistical/Bayesian  framework  for  tracking  of  moving  objects 
in  highly  cluttered  environments  on  the  basis  of  video  image  sequences.  The  framework 
involves  three  basic  components:  (i)  An  object  representation,  i.e.  a  model  that 
articulates  the  overall  shape  architecture  of  an  object  together  with  the  objects  random 
deformations  (rigid  and  non-rigid);  (ii)  a  dynamic  model,  i.e.  a  prior  on  the  set  of  possible 
trajectories  of  a  moving  object;  and  (iii)  an  observation  model  that  relates,  at  each  video 
frame,  the  image  gray-level  data  to  the  object  and  dynamic  models,  and  articulates  the 
random  variability  of  the  image  data  due  to  clutter,  occlusion,  and  other  image 
degradation.  The  combination  of  these  three  components  leads  to  a  nonlinear  filtering 
problem  which  is  equivalent  to  a  Hidden  Markov  Model  (HMM).  We  have  explored  two 
object  representations  (a  deformable  template  model,  and  a  hierarchical  syntactic 
model),  and  two  observation  models--  a  nonlinear  one  that  explores  the  HMM 
representation  of  the  filtering  problem,  and  a  linear  one  that  employs  the 
hierarchical/syntactic  models.  The  nonlinear  observation  model  is  combined  with  a 
Monte  Carlo  based  tracking  algorithm  and  runs  in  real  time.  The  linear  observation 
model  is  combined  with  the  Extended  Kalman  Filter  (EKF),  but  at  the  present  time  it 
does  not  run  in  real  time.  Our  experiments  demonstrate  that  the  Monte  Carlo  filter 
performs  considerably  better  than  the  EKF  in  cluttered  environments;  the  performance  of 
the  two  filters  is  comparable  in  environments  with  limited  image  degradation. 

Data-Driven  Performance  Optimization 

The  guiding  principle  of  minimum  description  length  has  been  espoused  as  means  of 
inferring  compositional  structure  in  scenes  by  members  of  the  group  from  all  five 
participating  universities.  R.  Brackett  and  colleagues  at  Harvard  have  developed  ways 
of  using  feedback  about  the  gradient  of  description  length  (or  other  measures  of  the 
complexity  of  the  description  of  an  image)  during  an  iterative  computation  as  a  way  of 
tuning  or  adapting  a  vision  algorithm’s  parameters  optimally  for  the  particular  scene 
being  processed.  The  use  of  feedback  in  this  setting  is  reminiscent  of  the  use  of 
feedback  loops  in  control  theory.  The  versatility  of  the  approach  to  data-driven 


performance  optimization  has  been  demonstrated  on  algorithms  for  (1)  region-based 
image  compression  and  (2)  extraction  of  coherent  structure  from  highly  cluttered  scenes. 
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